Kernel name: âNYC Taxi EDA - Update: The fast & the curiousâ
Link: https://www.kaggle.com/headsortails/nyc-taxi-eda-update-the-fast-the-curious
In this project we are going to explore the kernel above showing why the author choose the functions used, comparing to others alternatives and we are going to emphasize the interesting points of the kernel.
library("data.table")
library("tibble")
train <- as.tibble(fread('train.csv'))
In R, read.csv is part of the regular functions and is used for load data.frame from a csv file. But when weâre dealing with a huge data.frame this function can take a long time to run.
print(paste("In this case the dataset is quite huge:",dim(train)[1], "rows and",
dim(train)[2], "columns."))
## [1] "In this case the dataset is quite huge: 1458644 rows and 11 columns."
So in this part the author used a function called fread that performs much faster than read.csv (check the time of each function using profvis!!).
After that other function should be compared: load. This function is used to load variables that have been stored in a .RData file and runs very fast comparing with read.csv and fread.
When is a good ideia to use load? When itâs possible to use a background process to update the data.frame and save it in .RData file.
Letâs take a look at the three possibilities:
library("profvis")
library("data.table")
library("tibble")
library("readr")
profvis({
# fread
train <- fread("train.csv")
test <- fread("test.csv")
# read.csv
train_readcsv <- read.csv("train.csv")
# read_csv -> from "readr" package
train_read_csv <- read_csv("train.csv")
# as.tibble
train <- as.tibble(train)
test <- as.tibble(test)
# loading RData
save(train_readcsv, file = "train_data.RData")
rm(train_readcsv)
load(file = "train_data.RData")
})
All the information bellow was âgrepedâ from https://cran.r-project.org/web/packages/tibble/vignettes/tibble.html
Tibbles
âTibbles are a modern take on data frames. They keep the features that have stood the test of time, and drop the features that used to be convenient but are now frustrating (i.e. converting character vectors to factors).â
Major points:
A brief overview of our data can summaries the descriptive statistics values of the dataset and detect abnormal items or outliers.
For the summaries
summary(train)
## id vendor_id pickup_datetime dropoff_datetime
## Length:1458644 Min. :1.000 Length:1458644 Length:1458644
## Class :character 1st Qu.:1.000 Class :character Class :character
## Mode :character Median :2.000 Mode :character Mode :character
## Mean :1.535
## 3rd Qu.:2.000
## Max. :2.000
## passenger_count pickup_longitude pickup_latitude dropoff_longitude
## Min. :0.000 Min. :-121.93 Min. :34.36 Min. :-121.93
## 1st Qu.:1.000 1st Qu.: -73.99 1st Qu.:40.74 1st Qu.: -73.99
## Median :1.000 Median : -73.98 Median :40.75 Median : -73.98
## Mean :1.665 Mean : -73.97 Mean :40.75 Mean : -73.97
## 3rd Qu.:2.000 3rd Qu.: -73.97 3rd Qu.:40.77 3rd Qu.: -73.96
## Max. :9.000 Max. : -61.34 Max. :51.88 Max. : -61.34
## dropoff_latitude store_and_fwd_flag trip_duration
## Min. :32.18 Length:1458644 Min. : 1
## 1st Qu.:40.74 Class :character 1st Qu.: 397
## Median :40.75 Mode :character Median : 662
## Mean :40.75 Mean : 959
## 3rd Qu.:40.77 3rd Qu.: 1075
## Max. :43.92 Max. :3526282
summary(test)
## id vendor_id pickup_datetime passenger_count
## Length:625134 Min. :1.000 Length:625134 Min. :0.000
## Class :character 1st Qu.:1.000 Class :character 1st Qu.:1.000
## Mode :character Median :2.000 Mode :character Median :1.000
## Mean :1.535 Mean :1.662
## 3rd Qu.:2.000 3rd Qu.:2.000
## Max. :2.000 Max. :9.000
## pickup_longitude pickup_latitude dropoff_longitude dropoff_latitude
## Min. :-121.93 Min. :37.39 Min. :-121.93 Min. :36.60
## 1st Qu.: -73.99 1st Qu.:40.74 1st Qu.: -73.99 1st Qu.:40.74
## Median : -73.98 Median :40.75 Median : -73.98 Median :40.75
## Mean : -73.97 Mean :40.75 Mean : -73.97 Mean :40.75
## 3rd Qu.: -73.97 3rd Qu.:40.77 3rd Qu.: -73.96 3rd Qu.:40.77
## Max. : -69.25 Max. :42.81 Max. : -67.50 Max. :48.86
## store_and_fwd_flag
## Length:625134
## Class :character
## Mode :character
##
##
##
Data overview
library("dplyr")
glimpse(train)
## Observations: 1,458,644
## Variables: 11
## $ id <chr> "id2875421", "id2377394", "id3858529", "id3...
## $ vendor_id <int> 2, 1, 2, 2, 2, 2, 1, 2, 1, 2, 2, 2, 2, 2, 2...
## $ pickup_datetime <chr> "2016-03-14 17:24:55", "2016-06-12 00:43:35...
## $ dropoff_datetime <chr> "2016-03-14 17:32:30", "2016-06-12 00:54:38...
## $ passenger_count <int> 1, 1, 1, 1, 1, 6, 4, 1, 1, 1, 1, 4, 2, 1, 1...
## $ pickup_longitude <dbl> -73.98215, -73.98042, -73.97903, -74.01004,...
## $ pickup_latitude <dbl> 40.76794, 40.73856, 40.76394, 40.71997, 40....
## $ dropoff_longitude <dbl> -73.96463, -73.99948, -74.00533, -74.01227,...
## $ dropoff_latitude <dbl> 40.76560, 40.73115, 40.71009, 40.70672, 40....
## $ store_and_fwd_flag <chr> "N", "N", "N", "N", "N", "N", "N", "N", "N"...
## $ trip_duration <int> 455, 663, 2124, 429, 435, 443, 341, 1551, 2...
glimpse(test)
## Observations: 625,134
## Variables: 9
## $ id <chr> "id3004672", "id3505355", "id1217141", "id2...
## $ vendor_id <int> 1, 1, 1, 2, 1, 1, 1, 1, 2, 2, 1, 2, 1, 2, 1...
## $ pickup_datetime <chr> "2016-06-30 23:59:58", "2016-06-30 23:59:53...
## $ passenger_count <int> 1, 1, 1, 1, 1, 1, 1, 2, 2, 1, 4, 1, 1, 1, 1...
## $ pickup_longitude <dbl> -73.98813, -73.96420, -73.99744, -73.95607,...
## $ pickup_latitude <dbl> 40.73203, 40.67999, 40.73758, 40.77190, 40....
## $ dropoff_longitude <dbl> -73.99017, -73.95981, -73.98616, -73.98643,...
## $ dropoff_latitude <dbl> 40.75668, 40.65540, 40.72952, 40.73047, 40....
## $ store_and_fwd_flag <chr> "N", "N", "N", "N", "N", "N", "N", "N", "N"...
Another popular way to make a data overview is using str. It is very similar to glimpse but str shows less data.
str(train)
## Classes 'tbl_df', 'tbl' and 'data.frame': 1458644 obs. of 11 variables:
## $ id : chr "id2875421" "id2377394" "id3858529" "id3504673" ...
## $ vendor_id : int 2 1 2 2 2 2 1 2 1 2 ...
## $ pickup_datetime : chr "2016-03-14 17:24:55" "2016-06-12 00:43:35" "2016-01-19 11:35:24" "2016-04-06 19:32:31" ...
## $ dropoff_datetime : chr "2016-03-14 17:32:30" "2016-06-12 00:54:38" "2016-01-19 12:10:48" "2016-04-06 19:39:40" ...
## $ passenger_count : int 1 1 1 1 1 6 4 1 1 1 ...
## $ pickup_longitude : num -74 -74 -74 -74 -74 ...
## $ pickup_latitude : num 40.8 40.7 40.8 40.7 40.8 ...
## $ dropoff_longitude : num -74 -74 -74 -74 -74 ...
## $ dropoff_latitude : num 40.8 40.7 40.7 40.7 40.8 ...
## $ store_and_fwd_flag: chr "N" "N" "N" "N" ...
## $ trip_duration : int 455 663 2124 429 435 443 341 1551 255 1225 ...
## - attr(*, ".internal.selfref")=<externalptr>
To avoid an inappropriate analysis of the data, the missing values should be analysed to measure their impact in the whole dataset.
If the number of cases is less than 5% of the sample, then the researcher can drop them.
For more info about this subject: https://www.statisticssolutions.com/missing-values-in-data/
Luckly there is no missing values in data (easy mode):
sum(is.na(train))
## [1] 0
sum(is.na(test))
## [1] 0
Here the author did an interesting move: he combined train and test data sets into a single one in order to avoid a closely approach that matches just one part of data.
CAUTION: we can only combine the two data sets for a better overview but for the creation of a machine learning model we should keep train and test separate.
# Mutate creates dset, dropff_datetime and trip_duration columns for test dataset
# For train dataset only dset is created by mutate
# bind_rows combines the data sets into one
combine <- bind_rows(train %>% mutate(dset = "train"),
test %>% mutate(dset = "test",
dropoff_datetime = NA,
trip_duration = NA))
combine <- combine %>% mutate(dset = factor(dset))
glimpse(combine)
## Observations: 2,083,778
## Variables: 12
## $ id <chr> "id2875421", "id2377394", "id3858529", "id3...
## $ vendor_id <int> 2, 1, 2, 2, 2, 2, 1, 2, 1, 2, 2, 2, 2, 2, 2...
## $ pickup_datetime <chr> "2016-03-14 17:24:55", "2016-06-12 00:43:35...
## $ dropoff_datetime <chr> "2016-03-14 17:32:30", "2016-06-12 00:54:38...
## $ passenger_count <int> 1, 1, 1, 1, 1, 6, 4, 1, 1, 1, 1, 4, 2, 1, 1...
## $ pickup_longitude <dbl> -73.98215, -73.98042, -73.97903, -74.01004,...
## $ pickup_latitude <dbl> 40.76794, 40.73856, 40.76394, 40.71997, 40....
## $ dropoff_longitude <dbl> -73.96463, -73.99948, -74.00533, -74.01227,...
## $ dropoff_latitude <dbl> 40.76560, 40.73115, 40.71009, 40.70672, 40....
## $ store_and_fwd_flag <chr> "N", "N", "N", "N", "N", "N", "N", "N", "N"...
## $ trip_duration <int> 455, 663, 2124, 429, 435, 443, 341, 1551, 2...
## $ dset <fct> train, train, train, train, train, train, t...
For our following analysis, we will turn the data and time from characters into date objects. We also recode vendor_id as a factor. This makes it easier to visualise relationships that involve these features.
library('lubridate')
train <- train %>%
mutate(pickup_datetime = ymd_hms(pickup_datetime),
dropoff_datetime = ymd_hms(dropoff_datetime),
vendor_id = factor(vendor_id),
passenger_count = factor(passenger_count))
ASSUME NOTHING! It is worth checking whether the trip_durations are consistent with the intervals between the pickup_datetime and dropoff_datetime. Presumably the former were directly computed from the latter, but you never know. Below, the check variable shows âTRUEâ if the two intervals are not consistent:
train %>%
mutate(check = abs(int_length(interval(dropoff_datetime,pickup_datetime)) + trip_duration) > 0) %>%
select(check, pickup_datetime, dropoff_datetime, trip_duration) %>%
group_by(check) %>%
count()
## # A tibble: 1 x 2
## # Groups: check [1]
## check n
## <lgl> <int>
## 1 FALSE 1458644
And we find that everything fits perfectly.
# 2. Individual feature visualisations
library(leaflet)
library(leaflet.extras)
set.seed(1234)
foo <- sample_n(train, 8e3)
leaflet(data = foo) %>% addProviderTiles("Esri.NatGeoWorldMap") %>%
addCircleMarkers(~pickup_longitude, ~pickup_latitude, radius = 1,
color = "blue", fillOpacity = 0.3)